Goto

Collaborating Authors

 statistical analysis


Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts

Neural Information Processing Systems

Specifically, we pose and answer the following questions: Q1. How do the learned spatial and temporal representations vary based on different VSSL pretrain-ing methodologies? How robust are these representations to different distribution shifts?


Dual Principal Component Pursuit: Improved Analysis and Efficient Algorithms

Neural Information Processing Systems

Recent methods for learning a linear subspace from data corrupted by outliers are based on convex L1 and nuclear norm optimization and require the dimension of the subspace and the number of outliers to be sufficiently small [27]. In sharp contrast, the recently proposed Dual Principal Component Pursuit (DPCP) method [22] can provably handle subspaces of high dimension by solving a non-convex L1 optimization problem on the sphere. However, its geometric analysis is based on quantities that are difficult to interpret and are not amenable to statistical analysis. In this paper we provide a refined geometric analysis and a new statistical analysis that show that DPCP can tolerate as many outliers as the square of the number of inliers, thus improving upon other provably correct robust PCA methods. We also propose a scalable Projected Sub-Gradient Descent method (DPCP-PSGD) for solving the DPCP problem and show it admits linear convergence even though the underlying optimization problem is non-convex and non-smooth. Experiments on road plane detection from 3D point cloud data demonstrate that DPCP-PSGD can be more efficient than the traditional RANSAC algorithm, which is one of the most popular methods for such computer vision applications.


Dual Principal Component Pursuit: Improved Analysis and Efficient Algorithms

Neural Information Processing Systems

Recent methods for learning a linear subspace from data corrupted by outliers are based on convex L1 and nuclear norm optimization and require the dimension of the subspace and the number of outliers to be sufficiently small [27]. In sharp contrast, the recently proposed Dual Principal Component Pursuit (DPCP) method [22] can provably handle subspaces of high dimension by solving a non-convex L1 optimization problem on the sphere. However, its geometric analysis is based on quantities that are difficult to interpret and are not amenable to statistical analysis. In this paper we provide a refined geometric analysis and a new statistical analysis that show that DPCP can tolerate as many outliers as the square of the number of inliers, thus improving upon other provably correct robust PCA methods. We also propose a scalable Projected Sub-Gradient Descent method (DPCP-PSGD) for solving the DPCP problem and show it admits linear convergence even though the underlying optimization problem is non-convex and non-smooth. Experiments on road plane detection from 3D point cloud data demonstrate that DPCP-PSGD can be more efficient than the traditional RANSAC algorithm, which is one of the most popular methods for such computer vision applications.



Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Bean, Andrew M., Kearns, Ryan Othniel, Romanou, Angelika, Hafner, Franziska Sofia, Mayne, Harry, Batzner, Jan, Foroutan, Negar, Schmitz, Chris, Korgul, Karolina, Batra, Hunar, Deb, Oishi, Beharry, Emma, Emde, Cornelius, Foster, Thomas, Gausen, Anna, Grandury, María, Han, Simeng, Hofmann, Valentin, Ibrahim, Lujain, Kim, Hazel, Kirk, Hannah Rose, Lin, Fangru, Liu, Gabrielle Kaili-May, Luettgau, Lennart, Magomere, Jabez, Rystrøm, Jonathan, Sotnikova, Anna, Yang, Yushi, Zhao, Yilun, Bibi, Adel, Bosselut, Antoine, Clark, Ronald, Cohan, Arman, Foerster, Jakob, Gal, Yarin, Hale, Scott A., Raji, Inioluwa Deborah, Summerfield, Christopher, Torr, Philip H. S., Ududec, Cozmin, Rocher, Luc, Mahdi, Adam

arXiv.org Artificial Intelligence

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as 'safety' and 'robustness' requires strong construct validity, that is, having measures that represent what matters to the phenomenon. With a team of 29 expert reviewers, we conduct a systematic review of 445 LLM benchmarks from leading conferences in natural language processing and machine learning. Across the reviewed articles, we find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims. To address these shortcomings, we provide eight key recommendations and detailed actionable guidance to researchers and practitioners in developing LLM benchmarks.


Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Ball, Sarah, Hasrati, Niki, Robey, Alexander, Schwarzschild, Avi, Kreuter, Frauke, Kolter, Zico, Risteski, Andrej

arXiv.org Artificial Intelligence

Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable -- succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model's internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.


The "Right" Discourse on Migration: Analysing Migration-Related Tweets in Right and Far-Right Political Movements

Chatterjee, Nishan, Bajt, Veronika, Vitez, Ana Zwitter, Pollak, Senja

arXiv.org Artificial Intelligence

The rise of right-wing populism in Europe has brought to the forefront the significance of analysing social media discourse to understand the dissemination of extremist ideologies and their impact on political outcomes. Twitter, as a platform for interaction and mobilisation, provides a unique window into the everyday communication of far-right supporters. In this paper, we propose a methodology that uses state-of-the-art natural language processing techniques with sociological insights to analyse the MIGR-TWIT corpus of far-right tweets in English and French. We aim to uncover patterns of discourse surrounding migration, hate speech, and persuasion techniques employed by right and far-right actors. By integrating linguistic, sociological, and computational approaches, we seek to offer cross-disciplinary insights into societal dynamics and contribute to a better understanding of contemporary challenges posed by right-wing extremism on social media platforms.


AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu

Ammar, Muhammad, Hadi, Hadiya Murad, Butt, Usman Majeed

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are now capable of generating text that closely resembles human writing, making them powerful tools for content creation, but this growing ability has also made it harder to tell whether a piece of text was written by a human or by a machine. This challenge becomes even more serious for languages like Urdu, where there are very few tools available to detect AI-generated text. To address this gap, we propose a novel AI-generated text detection framework tailored for the Urdu language. A balanced dataset comprising 1,800 humans authored, and 1,800 AI generated texts, sourced from models such as Gemini, GPT-4o-mini, and Kimi AI was developed. Detailed linguistic and statistical analysis was conducted, focusing on features such as character and word counts, vocabulary richness (Type Token Ratio), and N-gram patterns, with significance evaluated through t-tests and MannWhitney U tests. Three state-of-the-art multilingual transformer models such as mdeberta-v3-base, distilbert-base-multilingualcased, and xlm-roberta-base were fine-tuned on this dataset. The mDeBERTa-v3-base achieved the highest performance, with an F1-score 91.29 and accuracy of 91.26% on the test set. This research advances efforts in contesting misinformation and academic misconduct in Urdu-speaking communities and contributes to the broader development of NLP tools for low resource languages.



Learning Analytics from Spoken Discussion Dialogs in Flipped Classroom

Su, Hang, Dzodzo, Borislav, Li, Changlun, Zhao, Danyang, Geng, Hao, Li, Yunxiang, Jaggi, Sidharth, Meng, Helen

arXiv.org Artificial Intelligence

--The flipped classroom is a new pedagogical strategy that has been gaining increasing importance recently. Spoken discussion dialog commonly occurs in flipped classroom, which embeds rich information indicating processes and progression of students' learning. This study focuses on learning analytics from spoken discussion dialog in the flipped classroom, which aims to collect and analyze the discussion dialogs in flipped classroom in order to get to know group learning processes and outcomes. We have recently transformed a course using the flipped classroom strategy, where students watched video-recorded lectures at home prior to group-based problem-solving discussions in class. The in-class group discussions were recorded throughout the semester and then transcribed manually. After features are extracted from the dialogs by multiple tools and customized processing techniques, we performed statistical analyses to explore the indicators that are related to the group learning outcomes from face-to-face discussion dialogs in the flipped classroom. Then, machine learning algorithms are applied to the indicators in order to predict the group learning outcome as High, Mid or Low. The best prediction accuracy reaches 78.9%, which demonstrates the feasibility of achieving automatic learning outcome prediction from group discussion dialog in flipped classroom. EARNING analytics is concerned with collection and analyses of data related to learning in order to inform and improve the learning process or their outcomes [1]. Applying properly learning analytics can not only track student progress but also improve student performance [2]. Recent advancements in the development of data science and machine learning techniques has led to a rise in popularity of learning analytics within the educational research field. The flipped classroom is a new pedagogical method, which utilizes asynchronous video lectures and basic practice as homework, and conducts group-based problem solving discussions or activities in the classroom [3]. Since flipped classroom promotes cooperative learning [4, 5] and increases student engagement and motivation [6, 7], it is gaining increasing importance for teaching and learning in recent years. A common in-class activity for the flipped classroom is student group discussions, where participants are involved in solving problems together. Such discussion dialogs embed rich information that cannot be captured objectively by conventional data, such as students' in-class sentiments, degree of concentration, amount of information exchange... etc. Authors are with The Chinese University of Hong Kong, Shatin, N.T., Hong Kong Therefore, spoken discussion dialogs in flipped classroom deserve greater attention for learning analytics, which aims to collect and analyze the discussion dialogs in flipped classroom in order to explore indicators that reflect group learning outcomes.